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SECOND  ORDER  BEHAVIOR  OF  PATTERN  SEARCH 

MARK  A.  ABRAMSON  * 

Abstract.  Previous  analyses  of  pattern  search  algorithms  for  unconstrained  and  linearly  con¬ 
strained  minimization  have  focused  on  proving  convergence  of  a  subsequence  of  iterates  to  a  limit 
point  satisfying  either  directional  or  first-order  necessary  conditions  for  optimality,  depending  on 
the  smoothness  of  the  objective  function  in  a  neighborhood  of  the  limit  point.  Even  though  pat¬ 
tern  search  methods  require  no  derivative  information,  we  are  able  to  prove  some  limited  directional 
second-order  results.  Although  not  as  strong  as  classical  second-order  necessary  conditions,  these 
results  are  stronger  than  the  first  order  conditions  that  many  gradient-based  methods  satisfy.  Under 
fairly  mild  conditions,  we  can  eliminate  from  consideration  all  strict  local  maximizers  and  an  entire 
class  of  saddle  points. 

Key  words,  nonlinear  programming,  pattern  search  algorithm,  derivative- free  optimization, 
convergence  analysis,  second  order  optimality  conditions 


1.  Introduction.  In  this  paper,  we  consider  the  class  of  generalized  pattern 
search  (GPS)  algorithms  applied  to  the  linearly  constrained  optimization  problem, 

min /(a:),  (1.1) 

A 

where  the  function  /  :  Rn  — >  R  U  {oo},  and  X  C  M"  is  defined  by  a  finite  set  of 
linear  inequalities  i. e.,  X  =  {x  £  R"  :  Ax  >  6},  where  A  £  Rmxra  ancl  5  g  Rm. 
We  treat  the  unconstrained,  bound  constrained,  and  linearly  constrained  problems 
together  because  in  these  cases,  we  apply  the  algorithm,  not  to  /,  but  to  the  “barrier” 
objective  function  fx  =  f  +  Vbo  where  ipx  is  the  indicator  function  for  A;  i.e.,  it  is 
zero  on  X ,  and  infinity  elsewhere.  If  a  point  x  is  not  in  X,  then  we  set  fx(x)  =  00, 
and  /  is  not  evaluated.  This  is  important  in  many  practical  engineering  problems  in 
which  /  is  expensive  to  evaluate. 

The  class  of  derivative-free  pattern  search  algorithms  was  originally  defined  and 
analyzed  by  Torczon  [27]  for  unconstrained  optimization  problems  with  a  continuously 
differentiable  objective  function  /.  Torczon ’s  key  result  is  the  proof  that  there  exists 
a  subsequence  of  iterates  that  converges  to  a  point  x*  which  satisfies  the  first-order 
necessary  condition,  V  f(x*)  =  0.  Lewis  and  Torczon  [20]  add  the  valuable  connection 
between  pattern  search  methods  and  positive  basis  theory  [16]  (the  details  of  which  are 
ingrained  into  the  description  of  the  algorithm  in  Section  2).  They  extend  the  class  to 
solve  problems  with  bound  constraints  [21]  and  problems  with  a  finite  number  of  linear 
constraints  [22],  showing  that  if  /  is  continuously  differentiable,  then  a  subsequence 
of  iterates  converges  to  a  point  satisfying  the  Karush-Kuhn-Tucker  (KKT)  first-order 
necessary  conditions  for  optimality. 

Audet  and  Dennis  [7]  add  a  hierarchy  of  convergence  results  for  unconstrained 
and  linearly  constrained  problems  whose  strength  depends  on  the  local  smoothness  of 
the  objective  function.  They  apply  principles  of  the  Clarke  [12]  nonsmooth  calculus 
to  show  convergence  to  a  point  having  nonnegative  generalized  directional  derivatives 
in  a  set  of  directions  that  positively  span  the  tangent  cone  there.  They  show  conver¬ 
gence  to  a  first-order  stationary  (or  KKT)  point  under  the  weaker  hypothesis  of  strict 
differentiability  at  the  limit  point,  and  illustrate  how  results  of  [21],  [22],  and  [27]  are 
corollaries  of  their  own  work. 
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Audet  and  Dennis  also  extend  GPS  to  categorical  variables  [6],  which  are  discrete 
variables  that  cannot  be  treated  by  branch  and  bound  techniques.  This  approach  is 
successfully  applied  to  engineering  design  problems  in  [2]  and  [19].  The  theoretical 
results  here  can  certainly  be  applied  to  these  mixed  variable  problems ,  with  the  caveat 
that  results  would  be  with  respect  to  the  continuous  variables  (i.e.,  while  holding 
the  categorical  variables  fixed).  An  adaptation  of  the  results  in  [6]  to  more  general 
objective  functions  using  the  Clarke  [12]  calculus  can  be  found  in  [1]. 

The  purpose  of  this  paper  is  to  provide  insight  into  the  second  order  behavior  of 
the  class  of  GPS  algorithms  for  unconstrained  and  linearly  constrained  optimization. 
This  may  seem  somewhat  counterintuitive,  in  that,  except  for  the  approach  described 
in  [3],  GPS  methods  do  not  even  use  first  derivative  information.  However,  the  nature 
of  GPS  in  evaluating  the  objective  in  multiple  directions  does,  in  fact,  lend  itself  to 
some  limited  discovery  of  second  order  theorems,  which  are  generally  stronger  than 
what  can  be  proved  for  many  gradient-based  methods.  Specifically,  while  we  cannot 
ensure  positive  semi-definiteness  of  the  Hessian  matrix  in  all  directions  (and  in  fact, 
we  show  a  few  counter-examples) ,  we  can  establish  this  result  with  respect  to  a  certain 
subset  of  the  directions,  so  that  the  likelihood  of  convergence  to  a  point  that  is  not  a 
local  minimizer  is  reasonably  small. 

This  paper  does  not  address  the  question  of  second  order  behavior  of  GPS  algo¬ 
rithms  for  general  nonlinear  constraints.  Extending  convergence  results  of  basic  GPS 
to  problems  with  nonlinear  constraints  requires  augmentation  to  handle  these  con¬ 
straints.  Lewis  and  Torczon  [23]  do  this  by  approximately  solving  a  series  of  bound 
constrained  augmented  Lagrangian  subproblems  [14],  while  Audet  and  Dennis  [9]  use 
a  filter-based  approach  [17].  The  results  presented  here  may  be  extendable  to  the  for¬ 
mer  but  not  the  latter,  since  the  filter  approach  given  in  [9]  cannot  be  guaranteed  to 
converge  to  a  first-order  KKT  point.  The  direct  search  algorithm  of  Lucidi  et  al.  [24] 
applies  positive  basis  theory  to  handle  nonlinear  constraints  in  a  way  similar  to  GPS, 
but  it  requires  constraint  derivatives  and  satisfaction  of  a  sufficient  decrease  condi¬ 
tion  to  ensure  convergence,  which  [23]  and  [9]  do  not.  Because  of  dissatisfaction  with 
these  limitations,  Audet  and  Dennis  [8]  recently  introduced  the  class  of  mesh-adaptive 
direct  search  (MADS)  algorithms,  a  generalization  of  GPS  that  achieves  first-order 
convergence  for  nonlinear  constrained  problems  by  generating  a  set  of  feasible  direc¬ 
tions  that,  in  the  limit,  becomes  asymptotically  dense  in  the  tangent  cone.  We  plan 
to  study  second-order  convergence  properties  of  MADS  in  future  work. 

The  remainder  of  this  paper  is  organized  as  follows.  In  the  next  section,  we 
briefly  describe  the  basic  GPS  algorithm,  followed  by  a  review  of  known  convergence 
results  for  basic  GPS  algorithms  in  Section  3.  In  Section  4,  we  show  that,  while 
convergence  to  a  local  maximizer  is  possible,  it  can  only  happen  under  some  very 
strong  assumptions  on  the  both  the  objective  function  and  the  set  of  directions  used 
by  the  algorithm.  In  Section  5,  we  introduce  additional  theorems  to  describe  second 
order  behavior  of  GPS  more  generally,  along  with  a  few  examples  to  illustrate  the 
theory  and  show  that  certain  hypotheses  cannot  be  relaxed.  Section  6  offers  some 
concluding  remarks. 

Notation.  R,  Q,  Z,  and  N  denote  the  set  of  real  numbers,  rational  numbers, 
integers,  and  nonnegative  integers,  respectively.  For  any  set  S ,  IS)  denotes  the  car¬ 
dinality  of  S,  and  —  S  is  the  set  defined  by—  S  =  {— s  :  s  €  S}.  For  any  finite  set 
A ,  we  may  also  refer  to  the  matrix  A  as  the  one  whose  columns  are  the  elements  of 
A.  Similarly,  for  any  matrix  A,  the  notation  a  €  A  means  that  a  is  a  column  of  A. 
For  x  £  X,  the  tangent  cone  to  X  at  £  is  Tx{x)  =  cl {p{w  —  x)  :  p  >  0,  w  £  X}, 
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and  the  normal  cone  Nx(x)  to  X  at  x  is  the  polar  of  the  tangent  cone;  namely, 
Nx(x)  =  {i>  G  R"  :  vTw  <0  V  w  G  Tx(x)}.  It  is  the  nonnegative  span  of  all 
outwardly  pointing  constraint  normals  at  x. 

2.  Generalized  Pattern  Search  Algorithms.  For  unconstrained  and  linearly 
constrained  optimization  problems,  the  basic  GPS  algorithm  generates  a  sequence  of 
iterates  having  nonincreasing  function  values.  Each  iteration  consists  of  two  main 
steps,  an  optional  SEARCH  phase  and  a  local  poll  phase,  in  which  the  barrier  objective 
function  fx  is  evaluated  at  a  finite  number  of  points  that  lie  on  a  mesh,  with  the  goal 
of  finding  a  point  with  lower  objective  function  value,  which  is  called  an  improved 
mesh  point. 

The  mesh  is  not  explicitly  constructed;  rather,  it  is  conceptual.  It  is  defined 
primarily  through  a  set  of  positive  spanning  directions  D  in  Rn;  i.  e.,  where  every 
vector  in  R"  may  be  represented  as  a  nonnegative  linear  combination  of  the  elements 
of  D.  For  convenience,  we  also  view  D  as  a  real  n  x  ud  matrix  whose  no  columns 
are  its  elements.  The  only  other  restriction  on  D  is  that  it  must  be  formed  as  the 
product 


D  =  GZ ,  (2.1) 

where  G  G  R"x™  is  a  nonsingular  real  generating  matrix,  and  Z  G  ZnxnD  is  an 
integer  matrix  of  full  rank.  In  this  way,  each  direction  dj  G  D  may  be  represented  as 
dj  =  Gzj,  where  Zj  G  is  an  integer  vector.  At  iteration  k,  the  mesh  is  defined  by 
the  set 


Mk  =  (J  {x  +  A kDz  :  z  G  N”D},  (2.2) 

xesk 

where  Sk  G  R”  is  the  set  of  points  where  the  objective  function  /  had  been  evaluated 
by  the  start  of  iteration  k,  and  Ak  >  0  is  the  mesh  size  parameter  that  controls 
the  fineness  of  the  mesh.  This  construction  is  the  same  as  that  of  [8]  and  [9],  which 
generalizes  the  one  given  in  [7].  It  ensures  that  all  previously  computed  iterates  will 
lie  on  the  current  mesh. 

The  SEARCH  step  is  simply  an  evaluation  of  a  finite  number  of  mesh  points.  It 
retains  complete  flexibility  in  choosing  the  mesh  points,  with  the  only  caveat  being 
that  the  points  must  be  finite  in  number  (including  none).  This  could  include  a  few 
iterations  using  a  heuristic,  such  as  a  genetic  algorithm,  random  sampling,  etc.,  or,  as 
is  popular  among  many  in  industry  (see  [5,  10,  11,  25]),  the  approximate  optimization 
on  the  mesh  of  a  less  expensive  surrogate  function.  A  related  algorithm  that  does  not 
require  the  surrogate  solution  to  lie  on  the  mesh  (but  requires  additional  assumptions 
for  convergence)  is  found  in  [15]. 

If  the  SEARCH  step  fails  to  generate  an  improved  mesh  point,  the  poll  step  is 
performed.  This  step  is  much  more  rigid  in  its  construction,  but  this  is  necessary 
in  order  to  prove  convergence.  The  poll  step  consists  of  evaluating  fx  at  points 
neighboring  the  current  iterate  xk  on  the  mesh.  This  set  of  points  Pk  is  called  the 
poll  set  and  is  defined  by 


Pk  —  {xk  +  A kd  :  d  G  Dk  C  D }  C  Mk ,  (2-3) 

where  Dk  is  a  positive  spanning  set  of  directions  taken  from  D.  We  write  Dk  C  D  to 
mean  that  the  columns  of  Dk  are  taken  from  the  columns  of  D.  Choosing  a  subset 
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Dk  C  D  of  positive  spanning  directions  at  each  iteration  also  adds  the  flexibility  that 
will  allow  us  to  handle  linear  constraints  in  an  efficient  fashion. 

If  either  the  SEARCH  or  POLL  step  is  successful  in  finding  an  improved  mesh  point, 
then  the  iteration  ends  immediately,  with  that  point  becoming  the  new  iterate  xk+\. 
In  this  case,  the  mesh  size  parameter  is  either  retained  or  increased  ( i.e .,  the  mesh  is 
coarsened).  If  neither  step  finds  an  improved  mesh  point,  then  the  point  Xk  is  said  to 
be  a  mesh  local  optimizer  and  is  retained  as  the  new  iterate  xk+\  =  xk,  and  the  mesh 
size  parameter  is  reduced  (i.e.,  the  mesh  is  refined). 

The  rules  that  govern  mesh  coarsening  and  refining  are  as  follows.  For  a  fixed 
rational  number  r  >  1  and  two  fixed  integers  w~  <  —  1  and  w+  >  0,  the  mesh  size  is 
updated  according  to  the  rule, 


Afc+1  =  rWk  Ak,  (2.4) 

where  Wk  G  {0, 1, . . . ,  tu+}  if  the  mesh  is  coarsened,  or  Wk  €  {w~ ,w~  +  1, . . . ,  —1}  if 
the  mesh  is  refined. 

From  (2.4),  it  follows  that,  for  any  k  >  0,  there  exists  an  integer  rk  such  that 

Ak+1=rrkA0.  (2.5) 

The  basic  GPS  algorithm  is  given  in  Figure  2.1. 


Generalized  Pattern  Search  (GPS)  Algorithm 

Initialization:  Let  So  be  a  set  of  initial  points,  and  let  Xo  £  So  satisfy  fx{x o)  <  oo 
and  fx{x o)  <  fx{y)  for  all  y  £  So-  Let  A0  >  0,  and  let  D  be  a  finite  set  of 
no  positive  spanning  directions.  Define  M0  C  X  according  to  (2.2). 

For  k  =  0,1,2,...,  perform  the  following: 

1.  SEARCH  step:  Optionally  employ  some  finite  strategy  seeking  an  improved 
mesh  point;  i.e.,  xk+i  £  Mk  satisfying  fx(xk+i)  <  fx(xk). 

2.  poll  step:  If  the  SEARCH  step  was  unsuccessful  or  not  performed,  evaluate 
fx  at  points  in  the  poll  set  Pk  (see  (2.3))  until  an  improved  mesh  point 
xk+\  is  found,  or  until  all  points  in  Pk  have  been  evaluated. 

3.  Update:  If  SEARCH  or  poll  finds  an  improved  mesh  point, 

Update  xk+i ,  and  set  A^.+i  >  Ak  according  to  (2.4); 

Otherwise,  set  xk+\  =  xk ,  and  set  A^+i  <  Ak  according  to  (2.4). 


Fig.  2.1.  Basic  GPS  Algorithm 

With  the  addition  of  linear  constraints,  in  order  to  retain  first-order  convergence 
properties,  the  set  of  directions  Dk  must  be  chosen  to  conform  to  the  geometry  of  the 
constraints.  The  following  definition,  found  in  [7]  (as  an  abstraction  of  the  ideas  and 
approach  of  [22]),  gives  a  precise  description  for  what  is  needed  for  convergence. 

Definition  2.1.  A  rule  for  selecting  the  positive  spanning  sets  Dk  =  D(k,xk)  C 
D  conforms  to  X  for  some  e  >  0,  if  at  each  iteration  k  and  for  every  boundary  point 
y  £  X  satisfying  \\y  —  aifc||  <  e,  the  tangent  cone  Tx(y)  is  generated  by  nonnegative 
linear  combinations  of  the  columns  of  Dk . 

Using  standard  linear  algebra  tools,  Lewis  and  Torczon  [22]  provide  a  clever  algo¬ 
rithm  to  actually  construct  the  sets  Dk.  If  these  sets  are  chosen  so  that  they  conform 
to  X,  all  iterates  lie  in  a  compact  set,  and  /  is  sufficiently  smooth,  then  a  subsequence 
of  GPS  iterates  converges  to  a  first-order  stationary  point  [7,  22]. 
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3.  Existing  Convergence  Results.  Before  presenting  new  results,  it  is  impor¬ 
tant  to  state  what  is  currently  known  about  the  convergence  properties  of  GPS  for 
linearly  constrained  problems. 

We  first  make  the  following  assumptions: 


Al:  All  iterates  {xk}  produced  by  the  GPS  algorithm  lie  in  a  compact  set. 

A2:  The  set  of  directions  D  =  GZ ,  as  defined  in  (2.1),  includes  tangent  cone 
generators  for  every  point  in  X. 

A3:  The  rule  for  selecting  positive  spanning  sets  Dk  conforms  to  X  for  some  e  >  0. 


Assumption  Al,  which  is  already  sufficient  to  guarantee  the  existence  of  con¬ 
vergent  subsequences  of  the  iteration  sequence,  is  a  standard  assumption  [6,  7,  9, 
14,  15,  17,  21,  22,  27].  A  sufficient  condition  for  this  to  hold  is  that  the  level  set 
L(x o)  =  {x  £  X  :  f(x)  <  f(x0)}  is  compact.  We  can  assume  that  L(x 0)  is  bounded, 
but  not  closed,  since  we  allow  /  to  be  discontinuous  and  extended  valued.  Thus  we 
can  assume  that  the  closure  of  L(x o)  is  compact.  We  should  also  note  that  most  real 
engineering  optimization  problems  have  simple  bounds  on  the  design  variables,  which 
is  enough  to  ensure  that  Assumption  Al  is  satisfied,  since  iterates  lying  outside  of 
X  are  not  evaluated  by  GPS.  In  the  unconstrained  case,  note  that  Assumptions  A2 
and  A3  are  automatically  satisfied  by  any  positive  spanning  set  constructed  from  the 
product  in  (2.1). 

Assumption  A2  is  automatically  satisfied  if  G  =  I  and  the  constraint  matrix  A 
is  rational,  as  is  the  case  in  [22].  Note  that  the  finite  number  of  linear  constraints 
ensures  that  the  set  of  tangent  cone  generators  for  all  points  in  X  is  finite,  which 
ensures  that  the  finiteness  of  D  is  not  violated. 

If  /  is  lower  semi-continuous  at  any  GPS  limit  point  x,  then  f(x )  <  lim/c  f(xk), 
with  equality  if  /  is  continuous  [7].  Of  particular  interest  are  limit  points  of  certain 
subsequences  (indexed  by  some  index  set  K)  for  which  lim^.g^  =  0.  We  know  that 
at  least  one  such  subsequence  exists  because  of  Torczon’s  [27]  key  result,  restated  here 
for  convenience. 

Theorem  3.1.  The  mesh  size  parameters  satisfy  liminf  A*,  =  0. 

k — >-+oo 

From  this  result,  we  are  interested  in  subsequences  of  iterates  that  converge  to  a 
limit  point  associated  with  A*,  converging  to  zero.  The  following  definitions  are  due 
to  Audet  and  Dennis  [7]. 

Definition  3.2.  A  subsequence  of  GPS  mesh  local  optimizers  {xk}k£K  (for 
some  subset  of  indices  I\)  is  said  to  be  a  refining  subsequence  if  {Ak}keK  converges 
to  zero. 

Definition  3.3.  Let  x  be  a  limit  point  of  a  refining  subsequence  {xk}k^K-  A 
direction  d  £  D  is  said  to  be  a  refining  direction  of  x  if  Xk  +  Akd  €  X  and  f(xk)  < 
f(xk  +  A kd)  for  infinitely  many  k  £  K. 

Audet  and  Dennis  [6]  prove  the  existence  of  at  least  one  convergent  refining  sub¬ 
sequence.  An  important  point  is  that,  since  a  refining  direction  d  is  one  in  which 
Xk  +  A  kd  £  X  infinitely  often  in  the  subsequence,  it  must  be  a  feasible  direction  at 
the  x,  and  thus  lies  in  the  tangent  cone  Tx(x). 

The  key  results  of  Audet  and  Dennis  are  now  given.  The  first  shows  directional 
optimality  conditions  under  the  assumption  of  Lipschitz  continuity,  and  is  obtained  by 
a  very  short  and  elegant  proof  (see  [7])  using  Clarke’s  [12]  definition  of  the  generalized 
directional  derivative.  Audet  [4]  provides  an  example  to  show  that  Lipschitz  conti¬ 
nuity  (and  even  differentiability)  is  not  sufficient  to  ensure  convergence  to  a  Clarke 
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stationary  point  ( i.e .,  where  zero  belongs  to  the  Clarke  generalized  gradient).  The 
second  result,  along  with  its  corollary  for  unconstrained  problems,  shows  convergence 
to  a  point  satisfying  first-order  necessary  conditions  for  optimality.  The  latter  two 
results  were  originally  proved  by  Torczon  [27]  and  Lewis  and  Torczon  [21,  22]  under 
the  assumption  of  continuous  differentiability  of  /  on  the  level  set  containing  all  of 
the  iterates.  Audet  and  Dennis  [7]  prove  the  same  results,  stated  here,  requiring  only 
strict  differentiability  at  the  limit  point. 

Theorem  3.4.  Let  x  be  a  limit  of  a  refining  subsequence,  and  let  d  £  D  be  any 
refining  direction  of  x.  Under  Assumptions  A1-A3,  if  f  is  Lipschitz  continuous  near 
x,  then  the  generalized  directional  derivative  of  f  atx  in  the  direction  d  is  nonnegative, 
i.e.,  f°(x;  d)  >  0. 

Theorem  3.5.  Under  Assumptions  A1-A3,  if  f  is  strictly  differentiable  at  a 
limit  point  x  of  a  refining  subsequence,  then  V  f(x)Tw  >  0  for  all  w  £  Tx(x),  and 
— V/( x)  £  Nx(x).  Thus,  x  satisfies  the  KKT  first-order  necessary  conditions  for 
optimality. 

Corollary  3.6.  Under  Assumption  Al,  if  f  is  strictly  differentiable  at  a  limit 
point  x  of  a  refining  subsequence,  and  if  X  =  K™  or  x  £  int(X),  then  V  f(x)  =  0. 

Although  GPS  is  a  derivative-free  method,  its  strong  dependence  on  the  set  of 
mesh  directions  presents  some  advantages  in  terms  of  convergence  results.  For  exam¬ 
ple,  if  /  is  only  Lipschitz  continuous  at  certain  limit  points  x* ,  Theorem  3.4  provides  a 
measure  of  directional  optimality  there  in  terms  of  the  Clarke  generalized  directional 
derivatives  being  nonnegative  [7].  In  the  next  two  sections,  we  attempt  to  prove  cer¬ 
tain  second  order  optimality  conditions,  given  sufficient  smoothness  of  the  objective 
function  /.  Our  goal  is  to  quantify  our  belief  that  convergence  of  GPS  to  a  point  that 
is  not  a  local  minimizer  is  very  rare. 

4.  GPS  and  Local  Maximizers.  We  treat  the  possibility  of  convergence  to 
a  local  maximizer  separate  from  other  stationary  points  because  what  we  can  prove 
requires  far  less  stringent  assumptions.  We  begin  with  an  example,  provided  by 
Charles  Audet,  to  show  that  it  is  indeed  possible  to  converge  to  a  maximizer,  even 
when  /  is  smooth. 

Example  4.1.  Let  f  :  R2  — >  K.  be  the  continuously  differentiable  function  defined 
by 

f(x,y)  =  -x2y2. 

Choose  (xo,yo)  =  (0,0)  as  the  initial  point,  and  set  D  =  [ei,e2,  —  ei,—  e^],  where  e± 
and  e2  are  the  standard  coordinate  directions.  Now  observe  that,  if  the  SEARCH  phase 
is  empty,  then  the  iteration  sequence  begins  at  the  global  maximizer  (0,0),  but  can 
never  move  off  of  that  point  because  the  directions  in  D  are  lines  of  constant  function 
value.  Thus  the  sequence  Xk  converges  to  the  global  maximizer  (0,0). 

Example  4.1  is  clearly  pathological.  Had  we  started  at  any  other  point  or  polled 
in  any  other  direction  (that  is  not  a  scalar  multiple  of  a  coordinate  direction),  the 
algorithm  would  not  have  stalled  at  the  maximizer  (0,0).  However,  it  is  clear  that 
the  method  of  steepest  descent  and  even  Newton’s  method  would  also  fail  to  move 
away  from  this  point. 

From  this  example,  one  can  envision  other  cases  (also  pathological),  in  which 
convergence  to  a  local  maximizer  is  possible,  but  without  starting  there.  However,  we 
can  actually  characterize  these  rare  situations,  in  which  convergence  to  a  maximizer 
can  occur.  Lemma  4.2  shows  that  convergence  would  be  achieved  after  only  a  finite 
number  of  iterations.  Under  a  slightly  stronger  assumption,  Theorem  4.3  ensures 
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that  convergence  to  a  maximizer  means  that  every  refining  direction  is  a  direction 
of  constant  function  value.  This  very  restrictive  condition  is  consistent  with  Exam¬ 
ple  4.1.  Corollary  4.4  establishes  the  key  result  that,  under  appropriate  conditions, 
convergence  to  a  strict  local  maximizer  cannot  occur.  This  result  does  not  hold  for 
gradient-based  methods,  even  when  applied  to  smooth  functions. 

Lemma  4.2.  Let  x  be  the  limit  of  a  refining  subsequence.  If  f  is  lower  semi- 
continuous  at  x,  and  if  x  is  a  local  maximizer  of  f  in  X,  then  Xk  =  x  is  achieved  in 
a  finite  number  of  iterations. 

Proof.  Since  x  =  lim  Xk  is  a  local  maximizer  for  /  in  X,  there  exists  an  open 

k&K 

ball  B(x,  e)  of  radius  e,  centered  at  x,  for  some  e  >  0,  such  that  /( x)  >  f(y)  for  all 
y  €  B(x,e)  n  X.  Then  for  all  sufficiently  large  k  G  AT,  Xk  G  B(x,e)  fl  X,  and  thus 
f(x)  >  f(xk).  But  since  GPS  generates  a  nonincreasing  sequence  of  function  values, 
and  since  /  is  lower  semi-continuous  at  x,  it  follows  that 

/Ofc)  <  f(x)  <  f(xk+ 1)  <  f(xk),  (4.1) 

and  thus  f(xk)  =  /( x),  for  all  sufficiently  large  k  G  K.  But  since  GPS  iterates  satisfy 
Xk+\  Xk  only  when  f(xk+  i)  <  f{xk),  it  follows  that  xk  =  x  for  all  sufficiently  large 
k.  m 

Theorem  4.3.  Let  x  be  the  limit  of  a  refining  subsequence.  If  f  is  lower  semi- 
continuous  in  a  neighborhood  of  x,  and  if  x  is  a  local  maximizer  of  f  in  X,  then  every 
refining  direction  is  a  direction  of  constant  function  value. 

Proof.  Let  d  G  D{x)  be  a  refining  direction.  Since  a:  is  a  local  maximizer,  there 
exists  6  >  0  such  that  x  +  td  G  X  and  f(x)  >  f(x  +  td )  for  all  t  G  (0,5).  Now  suppose 
that  there  exists  S  G  (0,  5)  such  that  /  is  continuous  in  B(x,  5)  and  f(x)  >  f(x  +  td) 
for  all  t  G  (0,5).  Then  f(x)  >  f(x  +  A kd)  for  Afc  G  (0,5).  But  since  Lemma  4.2 
ensures  convergence  of  GPS  in  a  finite  number  of  steps,  we  have  the  contradiction, 
f(x)  =  f(xk)  <  f(xk  +  Akd )  =  f(x  +  Akd)  for  all  sufficiently  large  k.  Therefore  there 
must  exist  5  >  0  such  that  f(x)  =  f(x  +  td)  for  all  t  G  (0, 5).  ■ 

Corollary  4.4.  The  GPS  algorithm  cannot  converge  to  any  strict  local  maxi¬ 
mizer  of  f  at  which  f  is  lower  semi- continuous. 

Proof.  If  a:  is  a  strict  local  maximizer  of  /  in  X,  then  the  first  inequality  of  (4.1) 
is  strict,  yielding  the  contradiction,  f(xk)  <  f(x)  <  f(xk)-  ■ 

The  assumption  that  /  is  lower  semi-continuous  at  x  is  necessary  for  all  three  of 
these  results  to  hold.  As  an  example,  consider  the  function  f(x)  =  1  if  x  =  0,  and 
x2  otherwise.  This  function  has  a  strict  local  maximum  at  0,  and  there  are  clearly 
no  directions  of  constant  function  value.  It  is  easy  to  see  that  any  sequence  of  GPS 
iterates  will  converge  to  zero,  and  by  choosing  an  appropriate  starting  point  and  mesh 
size,  we  can  prevent  convergence  in  a  finite  number  of  iterations.  The  theory  is  not 
violated  because  /  is  not  lower  semi-continuous  there. 

The  additional  assumption  in  Theorem  4.3  of  lower  semi-continuity  in  a  neigh¬ 
borhood  of  the  limit  point  (not  just  at  the  limit  point)  is  needed  to  avoid  other 
pathological  examples,  such  as  the  function  /( x)  =  0  if  x  G  Q  and  —  x2  if  x  ^  Q. 
Continuity  of  /  only  holds  at  the  local  maximizer  0,  and  there  are  no  directions  of 
constant  function  value.  A  typical  instance  of  GPS  that  uses  rational  arithmetic  would 
stall  at  the  starting  point  of  0. 

5.  Second  Order  Theorems.  An  interesting  observation  about  Example  4.1  is 
that,  even  though  (0,  0)  is  a  local  (and  global)  maximizer,  the  Hessian  matrix  is  equal 
to  the  zero  matrix  there,  meaning  that  it  is  actually  positive  semidefinite.  This  may 


seem  counterintuitive,  but  it  is  simply  a  case  where  the  curvature  of  the  function  is 
described  by  Taylor  series  terms  of  higher  than  second  order. 

Thus  an  important  question  not  yet  answered  is  whether  GPS  can  converge  to 
a  stationary  point  at  which  the  Hessian  is  not  positive  semidefinite  (given  that  the 
objective  is  twice  continuously  differentiable  near  the  stationary  point).  The  following 
simple  example  demonstrates  that  it  is  indeed  possible,  but  once  again,  the  algorithm 
does  not  move  off  of  the  starting  point. 

Example  5.1.  Let  f  :  R2  — >  R  be  the  continuously  differentiable  function  defined 
by 


f(x,y)  =  xy. 


Choose  (xo,yo)  =  (0,0)  as  the  initial  point,  and  set  D  =  [ei,e2,  —  ei,—  e^,  where  e\ 
and  e2  are  the  standard  coordinate  directions.  Now  observe  that,  if  the  SEARCH  step 
is  empty,  then  the  iteration  sequence  begins  at  the  saddle  point  (0,0),  but  can  never 
move  off  of  that  point  because  the  directions  in  D  are  lines  of  constant  function  value. 
Thus  the  sequence  Xk  converges  to  the  saddle  point.  Furthermore,  the  Hessian  of  f  at 
(0,  0)  is  given  by 


v2/(  0,0)  = 

which  is  indefinite,  having  eigenvalues  o/±l. 

This  result  is  actually  not  surprising,  since  many  gradient-based  methods  have 
this  same  limitation.  However,  the  results  that  follow  provide  conditions  by  which  a 
pseudo-second  order  necessary  condition  is  satisfied  -  one  that  is  weaker  than  the  tra¬ 
ditional  second  order  necessary  condition,  but  stronger  than  the  first-order  condition 
that  is  all  that  can  be  guaranteed  by  most  gradient-based  methods. 

We  are  now  ready  to  present  one  of  the  main  results  of  this  paper.  This  will  require 
the  use  of  the  Clarke  [12]  calculus  in  a  manner  similar  to  that  of  [7],  but  applied  to  f 
instead  of  /  itself.  We  will  denote  by  f°°(x-,  d\,  d^),  the  Clarke  generalized  directional 
derivative  in  the  direction  c?2  of  the  directional  derivative  f'(x\d\)  of  /  at  x  in  the 
fixed  direction  d\.  In  other  words,  if  g{x)  =  f'(x\  d\),  then  f°°{x\  d±,  d^)  =  g°(% ;  c^)- 
We  should  note  that  this  is  consistent  with  the  concepts  and  notation  given  in  [13] 
and  [18];  however,  we  have  endeavored  to  simplify  the  discussion  for  clarity.  First,  we 
give  a  general  lemma  that  is  independent  of  the  GPS  algorithm.  The  theorem  and 
corollary  that  follow  will  be  key  to  establishing  a  pseudo-second  order  result  for  GPS. 

Lemma  5.2.  Let  f  :  R"  — >  R  be  continuously  differentiable  at  x,  and  let  /'(•;  ±d) 
be  Lipschitz  near  x.  Then 


0  1 
1  0  ’ 


f°°{x\  d,d) 


=  limsup 

y^x,tiO 


f(y  +  td)  -  2f(y)  +  f(y  -  td) 

i 2 


Proof.  In  general,  we  can  apply  the  definition  of  the  generalized  directional  deriva- 
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tive  and  the  backward  difference  formula  for  directional  derivatives  to  obtain 

r  fe  d,  d)  =  (*;  d)  =  limsup  s(P  +  “0~sM 

y^x,tlO  t 

f(y  +  td;  d)  -  f(y;  d) 


=  lim  sup 

y—*x,tl  0 

=  lim  sup  - 

y— t 


t 

lim  f(y  +  id)  -  f(y  +  td—  sd )  _  ^  f(y)  -  f(y  -  sd) 

s — >-0  S  s— ►  0  S 


=  lim  sup 

y—*x,tl 0 


lim  f{y  +  t-d)  -  f(y  +  (t  -  s)d)  -  f(y)  +  f(y  -  sd) 


ts 


=  lim  sup 

y—>x,tl0  L 


f(y  +  td )  -  2 f(y)  +  f(y  -  td) 
I- 


where  the  last  equation  follows  from  letting  s  approach  zero  as  t  does  (which  is 
allowable,  since  the  limit  as  s  — >  0  exists  and  is  independent  of  how  it  is  approached). 


Theorem  5.3.  Let  x  be  the  limit  of  a  refining  subsequence,  and  let  D(x)  be 
the  set  of  refining  directions  for  x.  Under  Assumptions  A1-A3,  if  f  is  continuously 
differentiable  in  a  neighborhood  of  x,  then  for  every  direction  d  €  D(x)  such  that 
±d  £  D(x)  and  /'(•;  ±d)  is  Lipschitz  near  x,  f°°{x\d1d)  >  0. 

Proof.  From  Lemma  5.2,  it  follows  that 


f°°  (x;  d,  d)  =  lim  sup 

y—>x,tl  0 


f(y  +  td)  -  2  f(y)  +  f(y  -  td) 
t2 


>  lim 

fee  A' 

>  0, 


f(xk  +  A kd)  -  2 f(xk)  +  f(xk  -  A kd) 


A  l 


since  ±d  £  D(x)  means  that  f(xk)  <  f(xk  ±  A kd)  for  all  k  £  K.  m 

Corollary  5.4.  Let  x  be  the  limit  of  a  refining  subsequence,  and  let  D{x)  be  the 
set  of  refining  directions  for  x .  Under  Assumptions  A1  A3,  if  f  is  twice  continuously 
differentiable  at  x,  then  dTV2  f(x)d  >  0  for  every  direction  d  satisfying  ±d  £  D(x). 

Proof.  This  follows  directly  from  Theorem  5.3  and  the  fact  that,  when  V2/(x) 
exists,  dTV2 f(x)d  =  f°°(x ;  d,  d).  u 

The  following  example  illustrates  how  a  function  can  satisfy  the  hypotheses  of 
Theorem  5.3,  but  not  those  of  Corollary  5.4. 

Example  5.5.  Consider  the  strictly  convex  function  f  :  K  — >  K  defined  by 


/  O) 


x2,  if  x  >  0, 
— x3,  if  x  <  0. 


GPS  will  converge  to  the  global  minimizer  at  x  =  0  from  any  starting  point.  The 
derivative  of  f  is  given  by 


fix)  = 


2x,  if  x  >  0, 
— 3x2,  if  x  <  0. 


Clearly,  f  is  (Lipschitz)  continuous  at  all  x£l,  satisfying  the  hypotheses  of  Theo¬ 
rem  5.3.  The  second  derivative  of  f  is  given  by 


/"(*) 


2,  if  x  >  0, 
— 6x,  if  x  <  0, 
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and  it  is  does  not  exist  at  x  =  0.  Thus,  the  hypotheses  of  Corollary  5.f  are  violated. 
The  conclusion  of  Theorem  5.3  can  he  verified  by  examining  the  Clarke  derivatives  of 
f  at  x  =  0: 


/oo(0;  d,  d) 


lim  sup 

3/^0, tj.0 


f(y  +  td)  -  f(y)  +  f(y  -  td ) 
t 2 


>  lim  sup 

y- ►Ojtj.o 


(y  +  td)2 


(2 y)2  +  (y-  td)2 
t 2 


=  lim  sup 

y->  0,tj.0 


y 2  +  2  ytd  +  t2d2 


2  y2  +  y 2 
t 2 


2  ytd  +  t2d2 


=  2d2  >  0. 


5.1.  Results  for  Unconstrained  Problems.  For  unconstrained  problems,  re¬ 
call  that  if  /  is  twice  continuously  differentiable  at  a  stationary  point  x*,  the  second 
order  necessary  condition  for  optimality  is  that  V2  f(x*)  is  positive  semi-definite;  that 
is,  vTX72 f(x*)v  >  0  for  all  v  £  R™.  The  following  definition  gives  a  pseudo-second 
order  necessary  condition  that  is  not  as  strong  as  the  traditional  one. 

Definition  5.6.  Suppose  that  x*  is  a  stationary  point  of  a  function  f  :  Rn  — >  R" 
that  is  twice  continuously  differentiable  at  x*.  Then  f  is  said  to  satisfy  a  pseudo- 
second  order  necessary  condition  at  x  for  an  orthonormal  basis  V  C  Kn  if 

vTV2f(x*)v  >0  Vvef.  (5.1) 


We  note  that  (5.1)  holds  for  —V  as  well;  therefore,  satisfying  this  condition  means 
that  it  holds  for  a  set  of  “evenly  distributed”  vectors  in  R”. 

Now  recall  that  a  symmetric  matrix  is  positive  semidefinite  if  and  only  if  it  has 
nonnegative  real  eigenvalues.  The  following  theorem  gives  an  analogous  result  for 
matrices  that  are  positive  semidefinite  with  respect  to  only  an  orthonormal  basis.  We 
note  that  this  general  linear  algebra  result  is  independent  of  the  convergence  results 
presented  in  this  paper. 

Theorem  5.7.  Let  B  £  R”xra  be  symmetric,  and  let  V  be  an  orthonormal  basis 
for  R”.  If  B  satisfies  vT Bv  >  0  for  all  v  £  V ,  then  the  sum  of  its  eigenvalues  is 
nonnegative.  If  B  also  satisfies  vT  Bv  >  0  for  at  least  one  v  £  V ,  then  this  sum  is 
positive. 

Proof.  Since  B  is  symmetric,  its  Schur  decomposition  can  be  expressed  as  B  = 
QAQt ,  where  A  =  diag(Ai,  A2, . . .,  A„)  and  Q  £  Rnx"  is  an  orthogonal  matrix  whose 
columns  qi,  i  =  1, 2, . . . ,  n,  are  the  orthonormal  eigenvectors  corresponding  to  the  real 
eigenvalues  Aj,  i  =  1,  2, . . . ,  n.  Then  for  each  u*  £  V,  i  =  1, 2, . . .  ,n, 

n  n 

0  <  vjBvi  =  vf QAQTVi  =  ^2,Aj{QTvi)2  =  ^ A j(qfvi)2,  (5.2) 

3= 1  i=i 

and,  since  {<7y }y=i  and  {vi}?=i  are  both  orthonormal  bases  for  R",  it  follows  that 

7~L  Tt  Tb  Tb  Tb  7b  7b 

0  <J2vIBvi  =  J2J2  A/! '/''•/! 2  =  X'  Z]KT9i)2  =  J2  V  A,.  (5.3) 

i=  1  2—1  j  —  1  j  =  1  2—1  j  =  1  j  =  1 
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To  obtain  the  final  result,  observe  that  making  just  one  of  the  inequalities  in  (5.2) 
strict  yields  a  similar  strict  inequality  in  (5.3),  and  the  result  is  proved.  ■ 

It  is  easy  to  see  from  this  proof,  that  if  V  happens  to  be  the  set  of  eigenvectors  Q  of 
B ,  then  B  is  positive  (semi-)  definite,  since  in  this  case,  (5.2)  yields  qjvt  =  qj q-i  =  Sij, 
which  means  that  Xi  >  (>)0. 

We  now  establish  pseudo-second  order  results  for  GPS  by  the  following  two  the¬ 
orems.  The  first  theorem  requires  convergence  in  a  finite  number  of  steps,  while  the 
second  necessitates  the  use  of  a  more  specific  set  of  positive  spanning  directions. 

Theorem  5.8.  Let  V  be  an  orthonormal  basis  in  R".  Let  x  be  the  limit  of 
a  refining  subsequence,  and  let  D(x)  be  the  set  of  refining  directions  for  x.  Under 
Assumption  Al,  if  f  is  twice  continuously  differentiable  at  x,  Dk  A  V  infinitely  often 
in  the  subsequence,  and  xk  =  x  for  all  sufficiently  large  k,  then  f  satisfies  a  pseudo- 
second  order  necessary  condition  for  V  at  x. 

Proof.  For  all  k  £  K  and  d  £  D(x),  we  have  f(xk  +  A kd)  >  f(xk)-  Furthermore, 
for  all  sufficiently  large  k  £  K,  since  Xk  =  x,  a  simple  substitution  yields  f(x+ Akd)  > 
f(x)  for  all  d  £  D(x).  For  each  d  £  D(x),  Taylor’s  Theorem  yields 

f{x  +  A  kd)  =  f{x)  +  AkdTVf(x)  +  i  A  2kdTV2f(x)d  +  0(A3k). 

Since  Corollary  3.6  ensures  that  V/(£)  =  0,  we  have 

0  <  fix  +  A kd)  -  fix)  =  ^A 2kdTV2fix)d  +  (D(A|), 

or  dTV2 fix)d  >  O(Afc)  for  all  d  £  Dix)  and  for  all  sufficiently  large  k  £  K.  The 
result  is  obtained  by  taking  limits  of  both  sides  (in  K)  and  noting  that  Dix)  must 
contain  V.  m 

Theorem  5.9.  Let  V  be  an  orthonormal  basis  in  Rn.  Let  x  be  the  limit  of 
a  refining  subsequence,  and  let  Dix)  be  the  set  of  refining  directions  for  x.  Under 
Assumption  Al,  if  f  is  twice  continuously  differentiable  at  x  and  Dk  A  V  U  —  V 
infinitely  often  in  the  subsequence,  then  f  satisfies  a  pseudo-second  order  necessary 
condition  for  V  at  x.  Furthermore,  the  sum  of  the  eigenvalues  of  V2/(x)  must  be 
nonnegative. 

Proof.  Since  D( x)  C  D  is  finite,  it  must  contain  V  U  —  V,  and  the  result  follows 
directly  from  Corollary  5.4  and  Definition  5.6.  The  final  result  follows  directly  from 
the  symmetry  of  V2/(cc)  and  Theorem  5.7.  ■ 

The  significance  of  Theorem  5.9  is  that,  if  /  is  sufficiently  smooth,  then  the  choice 
of  orthonormal  mesh  directions  at  each  iteration  will  ensure  that  the  pseudo-second 
order  necessary  condition  is  satisfied,  and  that  the  sum  of  the  eigenvalues  of  V2/(x) 
will  be  nonnegative.  Thus,  under  the  assumptions,  GPS  cannot  converge  to  any  saddle 
point  whose  Hessian  has  eigenvalues  that  sum  to  less  than  zero. 

These  saddle  points  (to  which  GPS  cannot  converge)  are  those  which  have  suffi¬ 
ciently  large  regions  (cones)  of  negative  curvature.  To  see  this,  consider  the  contra¬ 
positive  of  Theorem  5.7  applied  to  the  Hessian  at  the  limit  point;  namely,  if  the  sum 
of  the  eigenvalues  of  V2/(a;)  is  negative,  then  for  any  orthonormal  basis  V  £  R™,  at 
least  one  vector  v  £  V  must  lie  in  a  cone  of  negative  curvature  (i.e.,  vTV2f(x)v  <  0). 
Since  the  angle  between  any  two  of  these  orthogonal  directions  is  90  degrees,  there 
must  be  a  cone  of  negative  curvature  with  an  angle  greater  than  90  degrees. 

The  following  example  shows  that,  even  for  orthonormal  mesh  directions,  it  is 
still  possible  to  converge  to  a  saddle  point  -  even  when  not  starting  there.  It  also 
illustrates  our  assertion  about  cones  of  negative  curvature. 
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Example  5.10.  Let  f  :  R2  — >  R.  be  the  twice  continuously  differentiable  function 
defined  by 


f(x,  y)  =  99x2  —  20 xy  +  y2  =  (9a:  —  y)( llx  —  y).  (5.4) 

Choose  ( Xo,yo )  =  (1,1)  as  the  initial  point,  and  set  D  =  {e±,  e?,  —  e\,  —  e2j,  where 
e\  and  e2  are  the  standard  coordinate  directions.  Now  observe  that,  at  the  saddle 
point  (0,0),  directions  of  negative  curvature  lie  only  in  between  the  lines  y  =  9x  and 
y  =  11a:.  Thus,  to  avoid  the  saddle  point,  the  GPS  sequence  would  have  to  include 
a  point  inside  the  narrow  cone  formed  by  these  two  lines,  when  sufficiently  close  to 
the  origin.  If  the  SEARCH  step  is  empty,  and  the  polling  directions  in  D  are  chosen 
consecutively  in  the  poll  step  (i.e.,  we  poll  in  the  order  ei,e2,—ei,—e2),  then  the 
iteration  sequence  arrives  exactly  at  the  origin  after  10  iterations  and  remains  there 
because  none  of  the  directions  in  D  point  inside  of  a  cone  of  negative  curvature. 
Figure  5.1  shows  the  cones  of  negative  curvature  for  f  near  the  saddle  point.  Note 
that  these  cones,  depicted  in  the  shaded  areas,  are  very  narrow  compared  to  those  of 
positive  curvature.  Thus,  for  the  search  directions  in  D,  it  will  be  difficult  to  yield  a 
trial  point  inside  one  of  these  cones. 

On  the  other  hand,  if  our  objective  function  were  —f,  then  the  cones  of  negative 
curvature  would  be  depicted  by  the  non-shaded  areas.  In  this  case,  Theorem  5.7  en¬ 
sures  that  GPS  cannot  converge  to  the  saddle  point,  since  any  set  of  2n  orthonormal 
directions  would  generate  a  trial  point  inside  one  of  these  cones,  and  thus  a  lower 
function  value  than  that,  of  the  saddle  point.. 


Cones  of  Negative  Curvature 


Fig.  5.1.  For  f(x,y)  =  {9x  —  y){llx  —  y),  the  cones  of  negative  curvature  at  the  saddle  point 
(0,0)  are  shown  in  the  shaded  area  between  the  lines  y  =  9x  and  y  =  llx. 


5.2.  Results  for  Linearly  Constrained  Problems.  We  now  treat  the  linear 
constrained  problem  given  in  (1.1).  At  this  point,  we  note  that  there  are  two  equiv¬ 
alent  formulations  for  the  classical  Karush-Kuhn- Tucker  (KKT)  first-order  necessary 
conditions  for  optimality,  one  of  which  is  imbedded  in  Theorem  3.5.  It  states  that 
a  point  x*  satisfies  the  first-order  necessary  conditions  if  V/( x*)T w  >  0  for  all  di¬ 
rections  w  in  the  tangent  cone  Tx(x*)  at  x* ,  and  —S7f{x*)  lies  in  the  normal  cone 
Nx(x*)  at  x* .  However,  since  we  do  not  have  such  a  straightforward  description  of 
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a  second  order  necessary  condition  in  this  form,  we  now  give  a  more  traditional  form 
of  the  KKT  necessary  conditions,  from  which  we  will  be  able  to  establish  a  sensible 
pseudo-second  order  condition.  The  following  lemma,  given  without  proof,  is  taken 
from  a  well-known  textbook  [26] . 

Lemma  5.11.  If  x*  is  a  local  solution  of  (1.1),  then  for  some  vector  X  of  Lagrange 
multipliers, 

1.  V/(x*)  =  AT X,  or  equivalently,  WT'V f(x*)  =  0, 

2.  A  >  0, 

3.  A T(Ax*  -  b)  =  0, 

4-  WTS72  f(x*)W  is  positive  semi-definite, 

where  the  columns  of  W  form  a  basis  for  the  null-space  of  the  active  constraints  at 
* 

x  . 

The  first  three  conditions  of  Lemma  5.11  are  generally  referred  to  as  first-order 
necessary  conditions,  while  the  last  is  the  second  order  necessary  condition.  Con¬ 
vergence  of  a  subsequence  of  GPS  iterates  to  a  point  satisfying  first-order  conditions 
has  been  proved  previously  [7,  22]  and  is  summarized  in  Theorem  3.5.  Based  on  the 
second  order  condition,  we  now  provide  a  pseudo-second  order  necessary  condition 
for  linearly  constrained  problems  that  is  analogous  to  that  given  in  Definition  5.6  for 
unconstrained  problems. 

Definition  5.12.  For  the  optimization  problem  given  in  (1.1),  let  W  be  an  or¬ 
thonormal  basis  for  the  null  space  of  the  binding  constraints  at  x* ,  where  x*  satisfies 
the  KKT  first-order  necessary  optimality  conditions,  and  f  is  twice  continuously  dif¬ 
ferentiable  at  x* .  Then  f  is  said  to  satisfy  a  pseudo-second  order  necessary  condition 
for  W  at  x*  if 


wTX2  f(x*)w  >  0  V  w  €  W. 


(5.5) 


The  following  theorem  shows  that  the  condition  given  in  (5.5)  has  an  equivalent 
reduced  Hessian  formulation  similar  to  Definition  5.6.  It  is  formulated  to  be  a  general 
linear  algebra  result,  independent  of  the  GPS  algorithm. 

Theorem  5.13.  Let  B  e  Rnxn  be  symmetric,  and  let  W  €  M.nxp  be  a  matrix 
with  orthonormal  columns  {u>i}f=1,  where  p  <  n.  Then  the  following  two  statements 
are  equivalent. 

1.  wf  Buii  >0,  i  =  1, 2, . . ,  ,p, 

2.  There  exists  a  matrix  Y  whose  columns  {i/i}f=i  form  an  orthonormal  basis 
for  such  that  yJWT BWyj  >  0,  j  =  1, 2, . .  .p. 

Proof.  Suppose  wf  Bwi  >  0,  *  =  1, 2, . . .  ,p.  Then  ef  Bet  >  0,  i  =  1,  2, . . .  ,p,  and 
the  result  holds  since  {ei}f=1  are  orthonormal. 

Conversely,  suppose  there  exists  Y  €  Rpxp  such  that  yfWT BWyj  >  0,  j  = 
1,2,  ...p.  Let  Z  =  WY  with  columns  {^}f=i-  Then  for  i  =  1,2  we  have 

zfBzi  =  yfWT BWyt  >  0.  Furthermore,  the  columns  of  Z  are  orthonormal,  since 
zf  Zj  =  ( Wyi)T(Wyj )  =  yfWTWyj  =  yfyj  =  Sij  (the  last  step  by  the  orthogonality 
ofF).  ■ 

Assumptions  A2-  A3  ensure  that  the  GPS  algorithm  chooses  directions  that  con¬ 
form  to  X  [7,  21,  22].  This  means  that  the  finite  set  T  of  all  tangent  cone  generators 
for  all  points  x  £  X  must  be  a  subset  of  D ,  and  that  if  an  iterate  Xk  is  within  e  >  0 
of  a  constraint  boundary,  then  certain  directions  in  T  must  be  included  in  D An 
algorithm  that  identifies  these  directions  Tk  C  T,  in  the  non-degenerate  case,  is  given 
in  [22],  where  it  is  noted  that  7)c  is  chosen  so  as  to  contain  a  (positive)  basis  for 
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the  null-space  of  the  e-active  constraints  at  xk ■  Thus,  the  set  of  refining  directions 
D{x)  will  always  contain  tangent  cone  generators  at  x,  a  subset  of  which  forms  a 
basis  for  null-space  of  the  active  constraints  at  x.  We  will  denote  this  null-space  by 
Af(A),  where  A  is  the  matrix  obtained  by  deleting  the  rows  of  A  corresponding  to  the 
non-active  constraints  at  x. 

However,  in  order  to  exploit  the  theory  presented  here  we  require  the  following 
additional  assumption  so  that  D{x)  will  always  contain  an  orthonormal  basis  for 

A /(A). 

A4:  The  algorithm  that  computes  the  tangent  cone  generators  at  each  iteration 
includes  an  orthonormal  basis  for  the  null-space  of  the  e-active  constraints  at  each 
iterate. 

Furthermore,  since  J\f(A)  contains  the  negative  of  any  vector  in  the  space,  we  can 
prove  convergence  to  a  point  satisfying  a  pseudo-second  order  necessary  condition  by 
including  Tk  U  —I),  in  each  set  of  directions  Dk. 

The  next  theorem  establishes  convergence  of  a  subsequence  of  GPS  iterates  to  a 
point  satisfying  a  pseudo-second  order  necessary  condition,  similar  to  that  of  Theo¬ 
rem  5.8  under  the  fairly  strong  condition  that  convergence  occurs  in  a  finite  number 
of  steps. 

Theorem  5.14.  Let  V  be  an  orthonormal  basis  for  K".  Let  x  be  the  limit  of 
a  refining  subsequence,  and  let  D(x)  be  the  set  of  refining  directions  for  x.  Under 
Assumptions  Al-Af,  if  f  is  twice  continuously  differentiable  at  x,  and  for  all  suffi¬ 
ciently  large  k,  Dk  C  V  and  Xk  =  x,  then  f  satisfies  a  pseudo-second  order  necessary 
condition  for  V  at  x. 

Proof.  For  all  k  €  K  and  d  €  D(x),  we  have  f(xk  +  A kd)  >  f(xk).  Furthermore, 
for  all  sufficiently  large  k  €  K,  since  Xk  =  x,  a  simple  substitution  yields  f(x+Akd)  > 
f(x)  for  all  d  €  D(x).  For  each  d  €  D( x),  Taylor’s  Theorem  yields 

fix  +  A  kd)  =  f[x)  +  A  kdrVf{x)  +  ^  A  2kdTV2f(x)d  +  O(Al). 

For  d  €  hf(A),  Lemma  5.11  ensures  that  dTX7f(x)  =  0,  and  thus 

0  <f(x  +  A kd)  -  f(x)  =  ^A 2kdTX72f(x)d  +  (D(A|), 

or  dTV2  f{x)d  >  O(Ak)  for  all  d  G  D{x)  fl  M{A)  and  for  all  sufficiently  large  k  €  K. 
The  result  is  obtained  by  taking  limits  of  both  sides  (in  K ),  since  D{x)  must  contain 
an  orthonormal  basis  for  Af(A) .  m 

In  the  theorem  that  follows,  we  show  that,  given  sufficient  smoothness  of  /,  if  mesh 
directions  are  chosen  in  a  fairly  standard  way,  a  subsequence  of  GPS  iterates  converges 
to  a  point  satisfying  a  pseudo-second  order  necessary  condition.  The  theorem  is  similar 
to  Theorem  5.9.  Once  again,  the  corollary  to  this  theorem  identifies  an  entire  class  of 
saddle  points  to  which  GPS  cannot  converge. 

Theorem  5.15.  Let.  V  be  an  orthonormal  basis  for  K".  Let  x  be  the  limit  of 
a  refining  subsequence,  and  let  D(x)  be  the  set  of  refining  directions  for  x.  Under 
Assumptions  Al-Af,  if  f  is  twice  continuously  differentiable  at  x  and  Dk  2  V  U 
— V  U  Tk  U  —Tk  infinitely  often  in  the  subsequence,  then  f  satisfies  a  pseudo-second 
order  necessary  condition  for  V  at  x. 

Proof.  From  the  discussion  following  Theorem  5.13,  D{x)  contains  an  orthonormal 
basis  W  for  A f{A).  Since  D  is  finite,  for  infinitely  many  k,  we  have  —W  C  —  Tk  C  Dk, 
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which  means  that  —W  C  D(x).  Thus,  D(x)  D  W  U  —W,  where  W  is  an  orthonormal 
basis  for  J\I(A),  and  the  result  follows  from  Corollary  5.4  and  Definition  5.12.  ■ 

Corollary  5.16.  If  hypotheses  of  Theorem  5.15  hold ,  then  the  sum  of  the 
eigenvalues  of  the  reduced  Hessian  WTV2  f(x)W  is  nonnegative,  where  the  columns 
of  W  form  a  basis  for  the  nidi  space  of  the  active  constraints  at  x. 

Proof.  Theorem  5.15  ensures  that  the  pseudo-second  order  condition  holds;  i.e., 
wT'V2 f(x)w  >  0  for  all  w  €  W.  Then  for  i  =  1, 2, . . . ,  \W\,  efWTV2f(lk)Wei  >  0, 
where  e*  denotes  the  ith  coordinate  vector  in  Since  WT\72 f(x)W  is  symmet¬ 

ric  and  {e,}|^  forms  an  orthonormal  basis  for  K I w I,  the  result  follows  from  Theo¬ 
rem  5.7.  ■ 

6.  Concluding  Remarks.  Clearly,  the  class  of  GPS  algorithms  can  never  be 
guaranteed  to  converge  to  a  point  satisfying  classical  second  order  necessary  conditions 
for  optimality.  However,  we  have  been  able  to  show  the  following  important  results, 
which  are  surprisingly  stronger  than  what  has  been  proved  for  many  gradient-based 
(and  some  Newton-based)  methods: 

•  Under  mild  assumptions,  GPS  can  only  converge  to  a  local  maximizer  if  it 
does  so  in  a  finite  number  of  steps,  and  if  all  the  directions  used  infinitely 
often  are  directions  of  constant  function  value  at  the  maximizer  (Lemma  4.2, 
Theorem  4.3). 

•  Under  mild  assumptions,  GPS  cannot  converge  to  or  stall  at  a  strict  local 
maximizer  (Corollary  4.4). 

•  If  /  is  sufficiently  smooth  and  mesh  directions  contain  an  orthonormal  basis 
and  its  negatives,  then  a  subsequence  of  GPS  iterates  converges  to  a  point 
satisfying  a  pseudo-second  order  necessary  condition  for  optimality  (Theo¬ 
rems  5.9  and  5.15). 

•  If  /  is  sufficiently  smooth  and  mesh  directions  contain  an  orthonormal  basis 
and  its  negatives,  then  GPS  cannot  converge  to  a  saddle  point  at  which 
the  sum  of  the  eigenvalues  of  the  Hessian  (or  reduced  Hessian)  are  negative 
(Theorem  5.9  and  Corollary  5.16). 

Thus  an  important  characteristic  of  GPS  is  that,  given  reasonable  assumptions,  the 
likelihood  of  converging  to  a  point  that  does  not  satisfy  second  order  necessary  con¬ 
ditions  is  small. 
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